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Abstract — In [Haruna, T. and Nakajima, K., 2011. Physica D 
240, 1370-1377], the authors introduced the duality between val- 
ues (words) and orderings (permutations) as a basis to discuss the 
relationship between information theoretic measures for finite- 
alphabet stationary stochastic processes and their permutation 
versions. It has been used to give a simple proof of the equality 
between the entropy rate and the permutation entropy rate for 
any finite-alphabet stationary stochastic process and show some 
results on the excess entropy and the transfer entropy for finite- 
alphabet stationary ergodic Markov processes. In this paper, 
we generalize our previous framework and show the equalities 
between various information theoretic complexity and coupling 
measures and their permutation versions. In particular, we prove 
the following two results within the realm of hidden Markov 
models with ergodic internal processes: the two permutation 
versions of the transfer entropy, the symbolic transfer entropy 
and the transfer entropy on rank vectors, are both equivalent to 
the transfer entropy if they are considered as the rates, and the 
directed information theory can be captured by the permutation 
entropy approach. 

Index Terms — Duality, Permutation Entropy, Excess Entropy, 
Transfer Entropy, Directed Information 



I. Introduction 

RECENTLY, the permutation-information theoretic ap- 
proach to time series analysis proposed by Bandt and 
Pompe [ 1 ] has become popular in various fields [2 1. It has been 
proved that the method of permutation is easy to implement 
relative to the other traditional methods, is computationally 
fast and is robust under the existence of noise 0, 0. 
However, if we turn our eyes to its theoretical side, few 
results are known for the permutation versions of information 
theoretic measures except the entropy rate. 

There are two approaches to introduce permutation into 
dynamical systems theory. The first approach was introduced 
by Bandt et al. (SJ. Given a one-dimensional interval map, 
they considered permutations induced by iterations of the 
map. Each point in the interval is classified into one of n\ 
permutations according to the permutation defined by n — 1 
times iterations of the map starting from the point. Then, the 
Shannon entropy of this partition (called standard partition) 
of the interval is taken and normalized by n. The quantity 
obtained in the limit n — > oo is called permutation entropy if 
it exists. It was proved that the permutation entropy is equal to 
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the Kolmogorov-Sinai entropy for any piecewise monotone in- 
terval map [6|. This approach based on the standard partitions 
was extended by Q. 

The second approach is taken by Amigo et al. [0, (8). In 
this approach, given a measure-preserving map on a proba- 
bility space, first an arbitrary finite partition of the space is 
taken. This gives rise to a finite-alphabet stationary stochastic 
process. An arbitrary ordering is introduced on the alphabet 
and the permutations of the words of finite lengths can be 
naturally defined (see Section [TT] below). It is proved that 
the Shannon entropy of the occurrence of the permutations 
of a fixed length normalized by the length converges in the 
limit of the large length of the permutations. The quantity 
obtained is called permutation entropy rate (also called metric 
permutation entropy) and is shown to be equal to the entropy 
rate of the process. By taking the limit of finer partitions of the 
measurable space, the permutation entropy rate of the measure- 
preserving map is defined if the limit exists. Amigo [9 | proved 
that it exists and is equal to the Kolmogorov-Sinai entropy. 

In this paper, we restrict our attention to finite-alphabet 
stationary stochastic processes. Thus, we follow the second 
approach, namely, ordering on the alphabet is introduced 
arbitrarily. For quantities other than the entropy rate, three 
results for finite-alphabet stationary stochastic Markov pro- 
cesses have been shown by our previous work: the equality 
between the excess entropy and the permutation excess entropy 
ifTOl . the equality between the mutual information expression 
of the excess entropy and its permutation version [11] and the 
equality between the transfer entropy rate and the symbolic 
transfer entropy rate lTT2l . 

The purpose of this paper is to set up a theoretical frame- 
work to discuss permutation versions of many information 
theoretic measures other than the entropy rate. In particular, 
we generalize our previous results for finite-alphabet stationary 
ergodic Markov processes to output processes of finite-state 
finite-alphabet hidden Markov models with ergodic internal 
processes. Upon this generalization, somewhat ad hoc proofs 
in our previous work become systematic and greatly simpli- 
fied. This makes us easily access quantities that have not been 
considered in the permutation approach. In this paper, we shall 
treat the following quantities: excess entropy iPPJI . transfer 
entropy (14), lfT31 . momentary information transfer [16] and 
directed information ifTTl . ifTSl . 

This paper is organized as follows: In Section UU we briefly 
review our previous result on the duality between words and 
permutations which is the basis for the succeeding results. 
In Section Hill we prove a lemma about finite-state finite- 
alphabet hidden Markov models. In Section |IV] we show 
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equalities between various information theoretic complexity 
and coupling measures and their permutation versions that 
hold for output processes of finite-state finite-alphabet hidden 
Markov models with ergodic internal processes. In Section IVl 
we discuss how our results are related to the previous work in 
the literature. 

II. The duality between words and permutations 

In this section, we summarize the results from our previous 
work 1 10 1 which will be used in this paper. 

Let A n be a finite set consisting of natural numbers from 1 
to n called an alphabet. In this paper A n is considered as a 
totally ordered set ordered by the usual iess-than-or-equal-to' 
relationship. When we emphasize the total order, we call A n 
ordered alphabet. 

The set of all permutations of length L > 1 is denoted 
by Sl- Namely, Sl is the set of all bijections n on the 
set {1,2,- •• ,L}. For convenience, we sometimes denote a 
permutation ir of length L by a string 7r(l)7r(2) • • • n(L). The 
number of descents, places with > ir(i + 1), of ir G Sl 
is denoted by Desc(7r). For example, if ir G S$ is given by 
7r(l)7r(2)7r(3)7r(4)7r(5) = 35142, then Desc(7r) = 2. 

Let = A n x • • • x A n be the L-fold product of A n . 

L 

A word of length L > 1 is an element of A„. It is denoted 
by x\-l := x\---xl '■= (xi,'" S A^. We say that 

the permutation type of a word x\-l is 7r G Sl if we have 
x^i) < ^(i+i) and < n(i + 1) when x w(l) = x 7r(l+1) 
for i = 1, 2, • • • , L — 1. Namely, the permutation type of xx-.l 
is the permutation of indices defined by re-ordering symbols 
x%, ■ ■ ■ ,xl in the increasing order. For example, the permuta- 
tion type of x h5 = 31212 G A\ is 7r(l)7r(2)7r(3)7r(4)7r(5) = 
24351 because X2X4X3X5X1 = 11223. 

Let <j> n ,L '■ — > Sl be a map sending each word xi : l to 
its permutation type ir = 4> n ,L{xi:L)- We define another map 
Hn,L ■ 4>n,L (A%) C S l -> A^ by the following procedure: 

(i) Given a permutation ir G <fi n ,L (An) Q <5d, we decom- 
pose the sequence 7r(l) • • • 7r(L) of length L into maximal 
ascending subsequences. A subsequence ij ■ ■ ■ ij+k of 
a sequence i\ - ■ -%l of length L is called a maximal 
ascending subsequence if it is ascending, namely, ij < 

< ■ ■ ■ < ij+k, and neither ij-iij ■ ■ ■ ij+k nor 
ijij+i ■ ■ ■ ij+k+i is ascending. 

(ii) If 7r(l) • • •Tr(ii), 7r(ii + 1) • • • 7r(i 2 ), • • • , 7r(i fc _i + 
1) • • • 7r(L) is a decomposition of 7r(l) • • • n(L) into max- 
imal ascending subsequences, then a word xi-l G is 
defined by 2^(1) = ■■■ = = 1, ^(ii+i) = • • • = 
^7r(i 2 ) = 2, • • • , ac 7r ( <fc _ 1 ) + i = • • • = x-x(l) = k. We de- 
fine fi n ,L(^) = xi-l- Note that Desc(7r) < n — 1 because 
7r is the permutation type of some word y\-x G A^. Thus, 
we have k = Desc(7r) + 1 < n. Hence, is well- 
defined as a map from <f) n ,L (An) to ^n- 

By construction, we have (\> n .L A*„,z,(7r) = 7r for all 
7T G 0n,L f^n)- To illustrate the construction of fi n .L, let us 
consider a word yx:5 = 21123 G A|. The permutation type of 
yi:5 is 7r(l)7r(2)7r(3)7r(4)7r(5) = 23145. The decomposition 
of 23145 into maximal ascending subsequences is 23, 145. 



We obtain n n ,L{^) = 2 1X2 £3 £4X5 = 21122 by putting 
X2X3X1X4X5 = 11222. 

Theorem 1: (i) For any tt G Sl, 

. , /L + n - Dcsc(7r) - 1\ 

where (J) = if a < b. 
(ii) Let us put B, hL ■= {x 1:L G A^|^~^(7r) = 
for some 7r G Sl} and (7 ni £ := {it G 
<5i||^~i(7r)| = 1}. Then, <p 7lyL restricted on B„. L is 
a map into C rii L, \i n ,L restricted on C n .L is a map into 
B n ^L, and they form a pair of mutually inverse maps. 
Furthermore, we have B n .L = {x\-l G A„|l < Vi < 
u — 1 1 < 3j < k < L s. t. Xj = i + 1, Xk = i} and 
C n ,L = {vr G Sl |Desc(7r) = n — 1}. 
Proof: The theorem is a recasting of statements in Lemma 
5 and Theorem 9 in [10|. ■ 
Let X = {Xi,X2,- ■ ■} be a finite-alphabet stationary 
stochastic process, where each stochastic variable Xj takes 
its value in A n . By the assumed stationarity, the probability 
of the occurrence of any word x\-l G A^ is time-shift 
invariant: Pr{Xi = x±,--- ,Xl = xl} = Pr{Xk+i = 
x-y,--- ,Xk+L = xl] for any k.L > 1. Hence, it makes 
sense to define it without referring to the time to start. 
We denote the probability of the occurrence of a word 
X\ l € by p{xi; L ) = p{x\ ■ ■ ■ Xt)- The probability 
of the occurrence of a permutation 7r G Sl is given by 

For a finite-alphabet stationary stochastic process X over 
the alphabet A n , we define 

«x,l := p(vr) = ^ p(tt) 

ic!l(")i>i 

and 

/3x,x,z, - Pr{a; 1:J v G A% \x 3 ± x for any 1 < j < N} 

l<j<N 

where L > 1, x G A„ and TV = [i/2j and |_aj is the largest 
integer not greater than a. 

Lemma 2: Let X be a finite-alphabet stationary stochastic 
process and e be a positive real number. If /3 Xi x,i < e for any 
x G A n , then we have «x,l < 2ne. 

Proof: The claim follows from Theorem [TJ (ii). See 
Lemma 12 in IfTUl for the complete proof. ■ 

iii. a result on finite-state finite- alphabet 
hidden Markov models 

A finite-state finite-alphabet hidden Markov model (in short, 
HMM) Q3 is a quadruple (£, A, {T^} aeA , fi), where E and 
A are finite sets called state set and alphabet, respectively, 
{T^}aeA is a family of |E| x |E| matrices indexed by 
elements of A where E| is the size of state set E, and fi is a 
probability distribution on the set E. The following conditions 
must be satisfied: 
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(i) T$ > for any s, s' 6 £ and aei, 
(") E^a^^lforanyaGE, 
(iii) and = E S) „ M^iy for any s' e £. 

Any probability distribution satisfying the condition (iii) is 
called a stationary distribution. The |£| x |£| matrix T := 
J2aeA is called state transition matrix. The ternary 
(£,T, jit) defines the underlying Markov chain. Note that the 
condition (iii) is equivalent to the condition (iii') fi(s') = 

£ s mO):t SS '. 

Two finite-alphabet stationary processes are induced by 
a HMM (£,A {T (a) } a£ A,M)- O ne is solely determined by 
the underlying Markov chain. It is called internal process 
and is denoted by S = {Si, S-2, • • • }■ The alphabet for 
S is £. The joint probability distributions which charac- 
terize S is given by Pr{5i = Si,^ = s 2 ,-- - ,Sl = 
s L } ■= n(si)T SlS2 ■ ■ -T SL _ lSL for any Si,---,s L 6 £ 
and L > 1. The other process X = {Xi, X 2 , ■ ■ ■} with 
the alphabet A is defined by the joint probability distri- 
butions Pr{Xi = £i,-X~2 = X2, ■■■ ,Xl = xl} := 

E S , S 'A*( S ) ( t(xi) ■ ■■ t( - Xl) ) ss ' for an y ,x L € A and 

L > 1 and called output process. The stationarity of the 
probability distribution fi ensures that of both the internal and 
output processes. 

Symbols a € A such that T^"- 1 = O occur in the output 
process with probability 0. Hence, we obtain the same output 
process even if we remove these symbols. Thus, we can 
assume T( Q ) ^ O for any a 6 A without loss of generality. 

The internal process S of a HMM (£, A, {T (a *>} aeA , fj) is 
called ergodic if the state transition matrix T is irreducible 
1 20 1 : for any s,s' 6 £ there exists k > such that 
(T k ) ss > > 0. If the internal process S is ergodic, then the 
stationary distribution /1 is uniquely determined by the state 
transition matrix T via the condition (iii'). It is known that the 
ergodicity of the internal process S implies that of the output 
process X, but not vice versa IF2TI . 

Note that there are two types of hidden Markov models 
depending on whether outputs are emitted from edges or states. 
The HMM defined here is edge emitting type. However, it is 
known that these two classes of HMM are equivalent fl9l . In 
particular, any finite-alphabet finite-order stationary Markov 
process can be described as a HMM defined here. 

Lemma 3: Let X be the output process of a HMM 
{^,A ni {T^} a£An ,n), where A n = {l,2,---,n} is an 
ordered alphabet. If the internal process S of the HMM is 
ergodic, then for any x S A n there exists < j x < 1 and 
C x > such that (3 x ^l < C x j x for any L > 1. 

Proof: Given L > 1, let us put N := [L/2\. Fix any 
x G A n . Since we have 



where 1 = (1, 1, • • • , 1) and (•••,-••) is the usual inner 
product of the |£ [-dimensional Euclidean space Rl s l, it is 
sufficient to show that the largest eigenvalue of the matrix 
Tt x \ := T—T 1 -'^ is less than 1. To prove this we shall appeal to 
the Perron-Frobenius theorem because T/ x \ is a non-negative 
matrix: 

(i) there exists a non-negative eigenvalue A called the 
Perron-Frobenius eigenvalue such that any other eigen- 
value of T( a -j has absolute value not greater than A, 

(ii) A<ma^{£ a ,(T (a .))„,}<l, 

(iii) and there exists a non-negative left eigenvector v corre- 
sponding to the eigenvalue A. 

We can show that for any e > there exists C c > such 
that for any k > 1 

WfiT^W <C t (X + e) k M, 

where || • • • || is the Euclidean norm and we used the fact 
that any non-negative matrix and its transpose have the same 
Perron-Frobenius eigenvalue. For the proof of this inequality, 
see the beginning of section 1.2 in J22), for example. If A < 1 
then we can choose e > so that A + e < 1. If we put 
J x := (A + e)i and C x := C e (A + e) _1 \\n\\ \\1\\ then we 
obtain /3 X! x,l < CxJ x by the Cauchy-Schwartz inequality as 
desired. 

Let us derive a contradiction from the assumption A = 1. 
If A = 1 then we have vT( x ) = v. For any k > 1, We have 

(v,l) = (vT ( *),l) < <vT fc ,l> = <v,T fc l> = (v,l>, 

because Tr x \ < T and T is a stochastic matrix. Thus, we 



obtain (v \ T k — T^j ,1) =0. Since 1 is a positive vector 
is a non-negative vector, it follows that 



and v ( T k - T k x 



v ( T k T k x) I = 0. 



Let us consider u, v! £ S such that T^) > 0. For any 
s,s' e 5, there exist k 1 ,k 2 > 1 such that (T kl ) > and 
(T ) , , > because T is irreducible. If we put k = k\ + 
ki + 1 then it holds that 



(r k ~T k x) 



XI,— .x k , 
3i s. t. Xi—x 



> (T fcl ) T (x ) (T k2 ) , , > 

— V I su «« V Ju's' 



On the other hand, the s'-th component of the vector 
v (r k -Tf x) ) must be 0: 



Ab,X,£ = p(xi---X N ) 



Xj ^X, 

l<j<N 



= J2 5Z^ s ) ( t(xi) "' t(xjv) ) 



\<f<N 



= (/i(T-T^) ,1), 



N 



(r fe -<U = ' 

s" 

where v s " denotes the s"-th component of v. We obtain v s — 
because v is a non-negative vector and (r k —T k x ^j is a 
non-negative matrix. Since s S S is arbitrary, we conclude 
that v = 0. However, this contradicts the fact that v is an 
eigenvector. ■ 
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IV. Permutation complexity and coupling 

MEASURES 

In this section, we discuss the equalities between complexity 
and coupling measures and their permutation versions for 
the output processes of HMMs whose internal processes are 
ergodic. 

A. Fundamental lemma 

Let (X , • • • , X m ) be a multivariate finite-alphabet station- 
ary stochastic process, where each univariate process X fc = 
{Xf , X$, ■ ■ ■}, k = 1, 2, • • • , m is defined over an ordered 
alphabet A nk , For simplicity, we use the notations 

P( x ai:bi ' * ' * ' x7 a m :b m ) 
\= Pl{X ai . bi — X ai . bl , ■ ■ ■ ,X™ m . bm =X™ m ;b m }: 

P(tTi, • • • , 7T m ) 

:= Pr{(j) nk . bk - ak+1 oX* k . bk = 7T fc , k = !,■■■ ,m} 

and 

p(n k ) := Pr{i k ,i k -, l+ io^ A = n*}, 

where 1 < a k < b k , x k ak . bk e and 7r fc G 5 6fe _ afe+1 

for fc = 1, • • • , to. 
Lemma 4: 

< H(X avbl , ■ ■ ■ ,X™ m . bm ) - H*(X ai . bl ,- ■ ■ ,X™ m , bm ) 

< E a Xfc,b fc -a fc + l I E nfc l0g ( &fc _ + 1 + n fe) ) 



By Theorem Q] (i), it holds that 



vfc=l 



vfc=l 



where 



and 



H(X ai . bi ,- ■ ■ ,X™ m . bm ) 

= ~ E P( x ai:5i) ' ' ' ' X T m :b m ) 

x logp(xi li5l ,-.- ,x™ m:b J 

n l A ai:6i! ' ' ' ) A a m :6„J 
= - X PC 71 "!'"' ,7Tm)l0gp(7ri,--- ,7T m ) 

7T1,"- ,7T m 

are the Shannon entropy of the joint occurrence of words 
x\ :6 , • • • ,x™ :b and permutations ni, ■ ■ ■ , n m , respec- 
tively, and the base of the logarithm is taken as 2. 
Proof: We have 

tf/yl ym \ tt* i -y\ vm \ 

tlV^a-L-.b-L J ' ' ' > A a m :b m J _ " l^oitbiJ ' ' ' > A a m :fc m J 

= E ' ' ' ,7, " m ) 



7Tl )•" ,7T m , 
0(tTi ,7T m )>0 



E 



Kfe<m 



n (*■*)> 



x loi 



P(tT1, • • • , 7T ro ) 

, P( X ai:fci' ' ' ' > a -a^ 1 :6 m ) 
P(7Tl, • • • ,7T ro ) 



< 



E 



Kk<m 



P{ x ai :bx' ' ' ' ' X aL:6 m ) 



X log 



h - afe + n k - Desc(7r fe ) 
bk - ak + 1 



<iogn 

fc=i 

m 

<log]^[(6 fc -a fc + l + n fe ) n '= 
fe=i 

m 

= X n fe log(6 fe - a fe + 1 + rife) 



fe=i 



for (7ri,--- ,7r m ) g 5 fcl _ ai+ i x ••• x S bm -a m+ i such that 

p(7Tl,--- ,7T m ) > 0. 

If |(0ni,6i-ai+l x ' ■ ' ( t > n m ,b m -a m +l)~ 1 (^U ' ' ■ , 7T m )| = 1 

then 



E 



P( X ai:i>l ; ' ' ' ; X "„:b„J 
p(7Tl, • • • ,7T m ) 



Kfe<m 



, K x ai:&i' ' ' ' ' X oL:6 m ) n 
X log 2—S =^2_ = 0. 

On the other hand, we have 

E P(7Tl,-- - ,7Tm) 

7Ti , ■ ■ ■ ,7T m , 
3fc S.t. l">,Tfc.6 fc -a fc +l( 7r '«)l >1 



< 



E E pfa) 

fc=l Tfc) 

l<.» fc -. fc + l('*)l>l 

m 

= E a X fc ,fc fc -a fc + l- 
fe=l 

This completes the proof of the inequality. ■ 

B. Excess Entropy 

Let X be a finite-alphabet stationary stochastic process. Its 
excess entropy is defined by iTOl 

E(X) = lim [H{Xi: L ) - h(X)L] 

L-^oo 
oo 

= ^2[H(X L \X 1:L ^)-h(X)}, 

L=l 

if the limit on the right-hand side exists, where h(X.) = 
lirriL-^oo H(Xi-,l)/L is the entropy rate of X which exists 
for any finite-alphabet stationary stochastic process [23 1. 

The excess entropy has been used as a measure of com- 
plexity Gl,|2a,123,||23,||2a,|29). Actually, it quantifies 
global correlations present in a given stationary process in the 
following sense: if E(X) exists then it can be written as the 
mutual information between the past and future 

E(X) = lim I(X 1:L ;X L+1:2L ). 
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It is known that if X is the output process of a HMM then 
E(X) exists ED. 

When the alphabet of X is an ordered alphabet A n , we 
define the permutation excess entropy of X JT0| by 

E*(X) = lim [H*(X U l) - h*(X)L] 

L— >oo 
oo 

= [H*(X L \X 1:L ^) - h*(X)] , 

L=l 

if the limit on the right-hand side exists, where h*(X) = 
lirrii^oo H*(Xx : l)/L is the permutation entropy rate of X 
which exists for any finite-alphabet stationary stochastic pro- 
cess and is equal to the entropy rate h(X) EL UH> and 
H*{X L \X hL -i) := H*(X 1:L ) - H*(X 1:L ^). 

The following proposition is a generalization of our previous 
results in iflOl. ifTTI. 

Proposition 5: Let X be the output process of a HMM 
(S, A n , {T^}aeA n , m) with an ergodic internal process. 
Then, we have 

E(X) = E*(X) = lim I*(X 1:L ;X L+1:2L ), 

L—rOo 

where I*(X 1:L ; X L+1:2L ) := H*(X 1:L ) + H*(X L+1:2L ) - 

H*(Xi : l,Xl+1:2L) = 2H*(X 1: l) ~ H*{X\-L, Xl+1:2L)- 

Proof: Let L > 1, We have 

| [H(X 1:L ) - h(X)L] - [H*(X 1:L ) - h*(X)L] 
= \H(X 1:L )-H*{X 1:L )\ 

< ax,L"log(i + n) 

< 2Cn 2 log(L + n)7 L , 

where C := maxjg^jCj}, 7 := maxx^Anilx} < 1 and 
we have used h(X) = h*(X) for the first equality, Lemma 
|4] for the second inequality and Lemma [2] and Lemma [3] for 
the last inequality. By taking the limit L — !> 00 we obtain 
E(X)=E*(X). 
To prove 

lim I(X 1:L ;X L+1:2L )= lim I* (X 1:L ; X L+1:2L ), 

L— ►oo L— >oo 

it is sufficient to show that 

\H(Xi : l,Xl+1;2L) " H*(Xi : l,Xl+1:2l)\ ~ > 

as L — > oo. This is because we have 

\I{Xi;L: Xl+1:2l) — I (X^z Xl+1:2L) \ 

<2\H(X UL )-H(X 1:L )\ 

+ \H(Xi : l,Xl+1:2l) — H * (Xi : l , Xl+1:2L) \ ■ 

However, this can be shown similarly with the above dis- 
cussion by applying Lemma [4] to the bivariate process 
(X X ,X 2 ) := (X, X) and then using Lemma [2] and Lemma 

a ■ 

C. Transfer Entropy and Momentary Information Transfer 

In this subsection we consider two information rates that 
are measures of coupling direction and strength between two 
jointly distributed processes and discuss the equalities between 
them and their permutation versions. One is the rate version 



of the transfer entropy IIT41 and the other is the rate version of 
the momentary information transfer iTT&l . Both are particular 
instances of conditional mutual information [ 30 1 . 

Let (X, Y) be a bivariate finite-alphabet stationary stochas- 
tic process. We assume that the alphabets of X and Y are 
ordered alphabets A n and A m , respectively. For r = 1, 2, • • ■ , 
we define the T-step transfer entropy rate from Y to X by 

t T (Y -> X) = lim [H(X 1:L+T ) - H(X 1:L ) 

- H(X 1:L+T , Y hL ) + H(X 1:L ,Y 1:L )] . 

When r = 1, ti(Y — > X) is called just transfer entropy rate 
ED from Y to X and simply denoted by t(Y -t X). 
If we introduce the T-step entropy rate of X by 

h T (X) = lim H(X L+1:L+T \X 1:L ) 

L—too 

and the r-step conditional entropy rate of X given Y by 
MX|Y)= lim H(X L+1 .. L+T \X 1:L ,Y ltL ) 

L— >oo 

then we can write 

t T (Y X) = h T (X) - h T (X\Y) 

because both h T (X) and h T {X\Y) exist. We call /ii(X|Y) 
conditional entropy rate and denote it by h(X\ Y0 
h T (X) is additive, namely, we always have 

h T (X) = Th x {X) = rh(X). 

However, for the r-step conditional entropy rate, the additivity 
cannot hold in general. It is at most super-additive: we only 
have the inequality 

h T (X\Y) > rh(X\Y) 
in general. Indeed, we have 

MX|Y)= I™ H(X L+1:L+T \X 1:L ,Y 1:L ) 

T 

- lim > H^Xl+t^X! 

L^oo * — 4 
r'=l 

T 

> lim y^H{X L +r'\X 1 .. L+ r'-l,Y 1:L +-r'-l) 
r'=l 

= rh(X\Y). 

This leads to the sub -additivity of the r-step transfer entropy 
rate: 

t T (Y -> X) < rt(Y -> X). 

An example with the strict inequality can be easily given. 
Let Y be an i.i.d. process and X be defined by Xi = Yi 
and X l+1 = Y^ We have h{X) = h(Y) = H(Y 1 ) and 
h T {X\Y) = (r - l)ff(Yi). Hence, t T (Y X) = H(Yi) 
for any r = 1, 2, • • • . 

There are two permutation versions of the transfer entropy. 
One is called symbolic transfer entropy (STE) [33] and the 

1 Note that the conditional entropy rate here is slightly different from that 
found in the literature. For example, in 11321 . conditional entropy rate (called 
conditional uncertainty) is defined by lim.r,_>oo H(Xl+i\Xi-.l, Yi-.l+i)- 
The difference from the conditional entropy rate defined here is in whether 
the conditioning on Yi+i is involved or not. 
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other is called transfer entropy on rank vector (TERV) |34|. 
Here, we introduce their rate versions as follows: the rate of 
STE from Y to X is defined by 

***(Y -> X) 

= lim [H*(X 1:L ,X 1+T:L+T )-H*(Xi :L ) 

L— >oo 

- H*{X UL ,X 1+T .. L+r ,Y 1 . x ) + H*(X 1:L ,Y 1 . X )] 

if the limit on the right-hand side exists. The rate of TERV 
from Y to X is defined by 

C(Y -> X) = lim [H*(X 1:L+T ) - H*(X UL ) 

- tf*(X 1:i+T , y 1:£ ) + H*{X 1:L ,Y 1:L )] . 

if the limit on the right-hand side exists. If E*(X) exists then, 
by the definition of the permutation excess entropy, we have 

h*(X) = lim [H*(X 1:L+1 ) - H*(X V . L )} . 

L— >oc 

In this case, tl(Y — > X) coincides with a quantity called 
symbolic transfer entropy rate introduced in 1121 . 

Proposition 6: Let (X, Y) be the output process of a HMM 
(S, A n x A m , {T < ^ a ' b ' ) } {atb - )l£AnXAm , with an ergodic inter- 
nal process. Then, we have 

t T (Y -> X) = t*(Y ->- X) = t;*(Y -> X). 

Proof: Since both X and Y are the output processes 
of appropriate HMMs with ergodic internal processes, the 
equalities follow from the similar discussion with that in 
the proof of Proposition [5] Indeed, for example, X is the 
output process of the HMM (£, A n , {T^} ae A n i fJ-) where 

T {a) ■■=Y. b eA m T(a ' h) - ■ 

A different instance of conditional mutual information 

called momentary information transfer is considered in [ 16 1. It 
was proposed to improve the ability to detect coupling delays 
which is lacked in the transfer entropy. Here, we consider 
its rate version: the momentary information transfer rate is 
defined by 

m T (Y -» X) 

= lim [ff(X 1:i+T ,y 1:i _ 1 )-ff(X 1:i+T _ l! Y 1:i _ 1 ) 

- H(X 1:L+T , Y 1:L ) + HiXuL+r-!, Y 1:L )] . 

Its permutation version called momentary sorting information 
transfer rate is defined by 

rn*(Y -> X) 

= lim [H*{X l ., L+T ,Y l .. L _ 1 )-H*{X l .. L+T _ 1 ,Y v . L _ l ) 

- H*(X 1:L+T , Y 1:L ) + H*(X 1:L+T ^,Y 1:L )] . 

By the similar discussion with that in the proof of Proposition 
[6] we obtain the following equality: 

Proposition 7: Let (X, Y) be the output process of a HMM 
(S, A n x A m , {T( a ' b) }( a , b ) eAnXAm , fj.) with an ergodic inter- 
nal process. Then, we have 

m T (Y -> X) = m*(Y -> X). 



D. Directed Information 

Directed information is a measure of coupling direction 
and strength based on the idea of the causal conditioning 
[18 1, [35 1 . Since it is not a particular instance of conditional 
mutual information, here we treat it separately. In the following 
presentation, we make use of terminologies from 0311 . [1361 . 

Let (X, Y) be a bivariate finite-alphabet stationary stochas- 
tic process. The alphabets of X and Y are ordered alphabets 
A n and A m , respectively. The directed information rate from 
Y to X is defined by 

Joo(Y X) = lim y/(Yi:L -> X lsL ) 

where 

L 

I(Y 1:L -+X llL ) =^I{X i ;Y 1 . A \X 1:i _ 1 ) 

L 

= H(X 1:L ) -Y,H(X i \X 1:i ^ 1 ,Y 1 . A ). 

i=l 

Note that if Y\ A in the above expression on the right-hand 
side is replaced by Y\-x then we obtain the mutual information 
between Xx-l and Y^: 

L 

I(X 1:L ;Y 1:L ) = H(X 1:L ) - ff(X i |X 1:i _ 1) y 1;i ). 

Thus, conditioning on Y\ A for i = 1, ••■ ,L, not on Y\.l, 
distinguishes the directed information from the mutual infor- 
mation. Following [35 1, we write 

L 

H{X 1:L \\Y 1:L ) :=^H{X i \X 1 .. i _ 1 ,Y VA ) 

i—l 

and call the quantity causal conditional entropy. By using this 
notation, we have 

I(Y 1:L X 1:L ) = H(X 1:L ) - H(X 1:L \\Y 1:L ). 

The permutation version of the directed information rate 
which we call symbolic directed information rate is defined 
by 

C(Y->X)= lim jI*(Y 1:L ^X 1:L ) 
if the limit on the right-hand side exists, where 

I*(Y UL ->X 1:i ) 

L 

:= H*(X 1:L ) - J2 [H*(X 1:i ,Y 1:i ) - H*{X 1: ^ u Y 1:l )]. 

i=l 

If we write 

-ff*(Xi :i ,yi :i ) + H*(A:i :i _i ) y 1 . i ) 

and 

ff*(X 1:L ||Y 1:L ) :=5^[H*(X 1:i ,Yi :i )-H*(Xi :i _ 1 ,yi ii )] 
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then we have the expressions 

L 

I*(Y 1:L ^ X 1:L ) =J2l*(X i ;Y 1:i \X 1:i _ 1 ) 

i=l 

= H*{X 1:L ) - H*(X 1:L \\Y 1:L ). 

Proposition 8: Let (X, Y) be the output process of a HMM 
(£, A n x A m , {T (a 'V}( atb - )eAnXAm , fi) with an ergodic inter- 
nal process. Then, we have 

7 00 (Y^X)=7^(Y^X). 

Proof: We have 

|J(Fl:£ -> X ltL ) - I*(Y 1:L -> 
< |J7(X 1:i )- J7*(X 1:L )| 

L 

+ Y / \H(X 1:l ,Y 1:l ) - H*(X 1:l7 Y 1:l )\ 

i=l 
L 

+ ^|#(X 1:i _ 1 ,Y 1:< ) - H*{X 1:i _ u Y 1:i )\. 
»=i 

We know that the first term on the right-hand side in the above 
inequality goes to as i — > oo. Let us evaluate the second 
sum. By Lemma |4j it holds that 

L 

^2\H(X 1:i ,Y 1:i ) - H*(X 1:i ,Y 1 . A )\ 

L 

< X] ( ax - 4 + a vd) [n log(i + n) + m log(i + m)] 
By Lemma [2] and Lemma [3] we have 

L L 

^ ctxjfi log(i + n) < 2Cn 2 7* log(i + n), 

i=l i=l 

where C := max ;c£j i n {C x } and 7 := max^g^^^} < 1. 
It is elementary to show that lirri£_ j . 00 J^ i=1 7*log(i + n) is 
finite. The limits of the other terms are also shown to be finite 
similarly. Thus, we can conclude that the limit of the second 
sum is bounded. Similarly, the limit of the third sum is also 
bounded. The equality in the claim follows immediately. ■ 

For output processes of HMMs with ergodic internal pro- 
cesses, properties on the directed information rate can be 
transferred to those on the symbolic directed information rate. 
Since proofs of them can be given by the same manner as 
those of the above propositions, here we list some of them 
without proofs. For the proofs of the properties on the directed 
information rate, we refer to [35], [36]. 

Let (X,Y) be the output process of a HMM (S,A„ x 
A m , {T( a ' 6 )} ( a ,b)eA n xA m i m) w i m an ergodic internal process. 
Then, we have 

(i) 

OY^X)= lim J*(X £; yi iZ |Xi :L _i). 
This is the permutation version of the equality 

7 oc (Y^X)= lim I(X L ;Y 1:L \X 1:L _i). 

L—^-oo 



(ii) 

/ oc (OT^X) = 4(I}Y^X) 

= lim 7*(X L ;Y 1:i _ 1 |X 1;i _ 1 ). 

Here, 

/ooCDY -> X) := lim yI(DY 1:L -> X llL ) 

and 

L 

I(DY 1:L -> X 1:L ) :=^/(X i; Yi :l _i|X 1:l _i). 

i=i 

The symbol 13 denotes the one-step delay. /^(DY — > 
X) is the corresponding permutation version. The second 
equality is the permutation version of the equality 

I ao (DY^X) = lim IiX^YuL-xlX^L^). 

L— >CX> 

Since I 00 (DY — > X) coincides with the transfer entropy 
rate, the first equality is just the equality between the 
transfer entropy rate and the symbolic transfer entropy 
rate (or the rate of 1-step TERV) proved in Proposition 
[6] given the second equality. 

(iii) 

-> X||DY) = /^(Y -> X||UY) 

= lim r(X L ;r L |Xi :L _i,Y 1:L _i), 

L— >oo 

where Ioo(Y — > X||£)Y) is called instantaneous infor- 
mation exchange rate and is defined by 

Joo(Y -> X||DY) := lim y/(Yi :Z , -> A"i:z||I>Yi :L ) 
and 

7(Yi :L ->-Xi :L ||£>Yi :L ) 
= ff(X 1:L ||W 1:L )-ff(X 1:L ||Y 1:L ,W 1:L ) 

L 

= ^ 7(X ; ; Fi : i|Xi : j_i, Yi:,_i) 

1=1 

L 

= J2 I (x i ;Y i \x 1:i _ 1 ,Y 1:i _ 1 ). 

1=1 

From the last expression of 7(Yi : l — > -X"i : z,||-DYi:l), we 
can obtain 

!«,(¥-> X||DY) = lim J(X z ;Yb|X 1:£ _i,Yi iZ _i). 

(Y — »• X||DY) is the corresponding permutation 
version and called symbolic instantaneous information 
exchange rate. 

(iv) 

OY -+ X) = I^DY -> X) + £,(Y ->• X||DY). 

Namely, the symbolic directed information rate decom- 
poses into the sum of the symbolic transfer entropy rate 
and the symbolic instantaneous information exchange 
rate. This follows immediately from (ii), (iii) and the 
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equality saying that the directed information rate decom- 
poses into the sum of the transfer entropy rate and the 
instantaneous information exchange rate: 

JoofY -> X) = JoopY -> X) + /oo(Y -»• X||DY). 

(v) 

£,(Y -> X) + I^DX -> Y) = C(X; Y). 

This is the permutation version of the equality saying 
that the mutual information rate between X and Y is the 
sum of the directed information rate from Y to X and 
the transfer entropy rate from X to Y: 

JoofY-^Xj+iootDX-* Y)=7 00 (X;Y), 

where 

/oc(X;Y) := lim yI(X UL ;Y 1:L ) 

is the mutual information rate and 7£>(X;Y) is its 
permutation version called symbolic mutual information rtjj') 
rate. It is known that they are equal for any bivariate 
finite-alphabet stationary stochastic process 0121 . Thus, 
the symbolic mutual information rate between X and Y 
is the sum of the symbolic directed information rate from 
Y to X and the symbolic transfer entropy rate from X 
to Y. 

We can also introduce the permutation version of the 
causal conditional directed information rate and prove the 
corresponding properties. To be precise, let us consider 
a multivariate finite-alphabet stationary stochastic process 
(X, Y, Z 1 , ■ • • , Z fc ) with the alphabet A n x A m x A h x • • • x 
Ai k . The causal conditional directed information rate from Y 
to X given Z := (Z 1 , • • ■ , Z k ) is defined by 

7 00 (Y->X||Z) := lim yI(Y 1:L -> X 1:L \\Zi L , ■ ■ • , Z\. L ) ( iv '> 
where 

7(Y 1:L ->X 1:L ||Zi L) .-. ,Zl L ) 

= H(Xi.. L \\Z\. L , ■ ■ ■ , Z 1:L ) - H(Xi :L \\Y 1:L ,Zl :L , ■■■ ,Z 1:L ) 

L 

= Yl:i\Xl;i-l, Z\. L , ■ ■ ■ , Zi :L ). 

Corresponding to Proposition [8] we have the following ( v ) 
equality if (X, Y, Z) is the output process of a HMM with 
an ergodic internal process: 

7 00 (Y^X||^) = 7^(Y^X||Z), 

where 7^(Y — > X||Z) is the symbolic causal conditional 
directed information rate which is defined by the same manner 
as the symbolic directed information rate. The following 
properties also hold: assume that (X, Y, Z) is the output 
process of a HMM with an ergodic internal process. Then, 
we have 

C) 

C(Y^X||Z) 

= lim I*(X L ;Y 1:L \X UL ^,Zl :L ,-- ,Z* :L ). 

L—¥oo 



This is the permutation version of the equality 

7 oc (Y^X||Z) 

= lim I(X L ;Y 1:L \X 1:L ^,Zl L ,--- ,Z\, L ). 

L—tOO 

Ioo{DY -> X\\Z) = I^DY -> X\\Z) 
= lim r(X L ;Y 1:L _ 1 \X 1:L _ 1 ,Zl L ,--- ,Zf :L ). 

L—>oo 

The second equality is the permutation version of the 
equality 

7 00 (7)Y^X||Z) 

= lim I(X L ;Y 1:L _ 1 \X 1:L _ 1 ,Zi L ,--- ,Z* L ). 

L— too 

The quantities 7 oc (7J>Y ->• X||Z) and 7^(DY X\\Z) 
are called causal conditional transfer entropy rate and 
symbolic causal conditional transfer entropy rate, respec- 
tively. 

T^Y -> X| \DY, Z) = 7^ (Y -> X| |DY, Z) 
= lim I*(Xi;Y i |X l! £_i,y l! £_ 1 ,Zj. i ,---,^ ii ), 

where 7oo(Y — ► X||7?Y, Z) is called causal conditional 
instantaneous information exchange rate. The second 
equality is the permutation version of the equality 

7 0O (Y^X||7JY,Z) 

= lim I(X L ;Y L \X 1:L ^,Y 1:L ^,Zl L ,--- ,Z\, L ). 

L—toc 

7^ (Y — > X||7)Y, Z) is the permutation version and is 
called symbolic causal conditional instantaneous infor- 
mation exchange rate. 

7^(Y^X||Z) 

= 7^(£>Y -> X||Z) +7^(Y -> X||7JY,Z). 
This is the permutation version of the equality 

7 00 (Y^X||Z) 

= 7 oc (7JY -> X||Z) H-I^Y -> X||79Y,Z). 

7^(Y -». X||Z) + 7^ (TJX -> Y\\Z) = 7^(X; Y||Z). 
This is the permutation version of the equality 

7 oc (Y -)• X||Z) + 7 oc (7JX -> Y||Z) = 7 oc (X; Y||Z), 
where 

7 oc (X;Y||Z) 

:= lim j[H(X UL \\Zl L ,-.. ,Z k x , L ) 

L— ¥00 L 

+ H(Y 1:L \\Zi L ,--- ,Z* L ) 
-H(X 1:L ,Y 1:L \\Zl :L ,--. ,Z£ L )] 

is the causal conditional mutual information rate and 
7^(X; Y||Z) is its permutation version called symbolic 
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causal conditional mutual information rate. It can be 
shown that 

I 00 (X;Y\\Z)=I* 00 {X ] Y\\Z) 

if (X, Y, Z) is the output process of a HMM with an 
ergodic internal process. 

V. Discussion 

In this section, we discuss how our theoretical results in this 
paper are related to the previous work in the literature. 

Being confronted with real time series data, we cannot 
take the limit of large length of words. Hence, we have to 
estimate information rates with finite length of words. In such 
situation, one permutation method could have some advantages 
to the other permutation methods. As a matter of fact, TERV 
was originally proposed as an improved version of STE [34- 1 . 
However, it has been unclear whether they coincide in the limit 
of large length of permutations. In this paper, we provide a 
partial answer to this question: the two permutation versions 
of the transfer entropy rate, the rate of STE and the rate of 
TERV, are equivalent to the transfer entropy rate for bivariate 
processes generated by HMMs with ergodic internal processes. 

Granger causality graph [37] is a model of causal depen- 
dence structure in multivariate stationary stochastic processes. 
Given a multivariate stationary stochastic process, nodes in 
a Granger causality graph are components of the process. 
There are two types of edges: one is directed and the other is 
undirected. The absence of a directed edge from one node to 
another node indicates the lack of the Granger cause from the 
former to the latter relative to the other remaining processes. 
Similarly, the absence of a undirected edge between two nodes 
indicates the lack of the instantaneous cause between them 
relative to the other remaining processes. Amblard and Michel 
[31 1, [ 36 1 proposed that the Granger causality graph can be 
constructed based on the directed information theory: let X = 
(X 1 , X 2 , ■ • • , X m ) be a multivariate finite-alphabet stationary 
stochastic process with the alphabet A ni x A n2 x • • • x A Um 
and (V, Ed, E u ) be the Granger causality graph of the process 
X where V = {1, 2, • ■ • , m} is the set of nodes, E& is the set 
of directed edges and E u is the set of undirected edges. Their 
proposal is that 

(i) for any i,j € V, $ Ed if and only if /oo(flX' ->■ 
X?\\X\{X. i ,X?}) = 0, 

(ii) for any i,j G V, £ E u if and only if Ioo(X l -)• 
X^\\DX\X\{X\X^}) =0. 

Thus, in the Granger causality graph construction proposed in 
[31 1, [36 1, the causal conditional transfer entropy rate captures 
the Granger cause from one process to another process relative 
to the other remaining processes. On the other hand, the causal 
conditional instantaneous information exchange rate captures 
the instantaneous cause between two processes relative to the 
other remaining processes. 

Now, let us consider the case when X is the output process 
of a HMM with an ergodic internal process. Then, from the 
results of Section IIV-DI we have 

(i') for any i,j e V, $ E d if and only if I^ to (DX i -> 
X.3\\X\{X.\X.i}) = 0, 



(ii') for any i,j G V, g" E u if and only if /^(X 1 

X j \\DX i ,X\{X i ,X^}) =0. 
Thus, the Granger causality graphs in the sense of [31], 
[36 1 for multivariate processes generated by HMMs with 
ergodic internal processes can be captured by the language 
of the permutation entropy: the symbolic causal conditional 
transfer entropy rate and the symbolic instantaneous infor- 
mation exchange rate. This statement opens up a possibility 
of the permutation approach to the problem of assessing 
the causal dependence structure of multivariate stationary 
stochastic processes. However, of course, the details of the 
practical implementation should be an issue of further study. 
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